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The feature that makes item response theory (ERT) models the models of choice for many 
psychometric data analysts is parameter invariance, the equality of item and examinee 
parameters from different populations. Using the well-known fact that item and examinee 
parameters are identical only up to a set of linear transformations specific to the functional form 
of a given IRT model, violations of these transformations for unidimensional IRT models are 
algebraically investigated and bias coefficients are derived for some violations. Since a lack of 
invariance constitutes item parameter drift (IPD) at the individual item level or item-set level, the 
magnitude and types of biases introduced by IPD along with their impact on examinee true 
scores can be algebraically derived and these connections are demonstrated with results from a 
recently published simulation study (Wells, Subkoviak, & Serlin, 2002). This paper facilitates a 
deeper understanding of different types of lack of parameter invariance and their practical 
consequences for decision-making through a framework that combines analytical, numerical, and 
visual perspectives on parameter invariance as a fundamental property of measurement. 
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Bias Coefficients for Lack of Invariance in Unidimensional IRT Models 
Item response theory (IRT) is one of the most popular current methodological frameworks 
for modeling response data from assessments. It is used directly in computer adaptive testing, 

cognitively diagnostic assessment, and test equating among other applications (e.g., Hambleton, 

} 

Swaminathan, & Rogers, 1991; Junker, 1999; Kaskowitx & de Ayala, 2001). Furthermore, 
output from IRT models has more recently been incorporated into hierarchical regression models 
for multilevel data (e.g., Adams, Wilson, & Wu, 1997; Fox & Glas, 2001). The versatility of IRT 
models has made them the preferred tool of choice for many psychometric modelers, but beyond 
the flexibility of IRT models it is the often misunderstood feature of parameter invariance that is 
frequently cited in introductory or advanced texts as one of their most important characteristics 
(e.g., Hambleton & Jones, 1993; van der Linden & Hambleton, 1997; Hambleton et al., 199 1; 
Lord, 1980). Since invariance relates to generalizability across contexts, parameter invariance in 
IRT models allows for the generalizability of inferences across context and thus constitutes a 
fundamental property of measurement. 

In this paper, the mathematical formalization of parameter invariance is used to develop a 
framework for algebraic, numeric, and visual investigations of biases introduced by different 
types of lack of invariance (LOI). The derivations in this paper are presented to clarify facets and 
implications of parameter invariance for a broad and more applied audience. In the term 
^parameter invariance’, the parameters referred to are both the set of item parameters and the set 
of examinee parameters. The word ‘parameter’ indicates that the term refers io population 
quantities, which are treated as fixed but unknown (in a frequentist framework) or random but 
unknown (in a Bayesian framework) and whose values are estimated with data collected within a 
random sampling framework. The word invariance indicates that parameter values are identical 
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in separate populations, which is commonly of concern when parameters are estimated 
repeatedly with different calibration samples that represent subsets of different populations of 
interest. Most importantly, parameter invariance denotes an absolute ideal state that holds only 

for perfect model fit and any discussion about whether there are “degrees of invariance” or 

/ 

whether there is “some invariance” are technically inappropriate (Hambleton et al., 1991). 
Moreover, the question of whether there is invariance in a given population is illogical as well as 
invariance requires at least two populations or conditions for parameter comparisons to be 
possible and meaningful. 

The mathematical relationships that define parameter invariance are of course not novel per 
se and can be found, albeit often more cryptically, in other sources (e.g.. Lord, 1980). In 
addition, work in score equating, differential item functioning (DIF), and item parameter drift 
(IPD) deals with LOI and the biases introduced thereby (e.g., Donoghue & Isham, 1998). 
However, the literature does not provide simple and widely accessible algebraic work on the 
conditions of parameter invariance and possible violations of these, which is why the work in 
this paper seeks to clarify many of the subtleties of parameter invariance for practitioners and 
theoreticians alike. 

Mathematically, parameter invariance is a simple identity for parameters that are on the 
same scale..Yet the latent scale in IRT models is arbitrary so that unequated sets of model 
parameters are invariant only up to a set of linear transformations specific to a given IRT model. 
When estimating these parameters in unidimensional IRT models with calibration samples, this 
indeterminacy is typically resolved by requiring that the latent indicator 6 be normally 
distributed with mean 0 and standard deviation 1 (i.e., 6~N (0,1)). In orthogonal 
multidimensional IRT models, the latent scale indeterminacy implies that parameters are 
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identical up to an orthogonal rotation, a translation transformation, and a single dilution or 
contraction. When estimating these parameters with calibration samples, the indeterminacy is 
typically resolved by requiring that the multivariate latent indicator 0 be multivariate normally 

distributed with, mean 0 and variance-covariance matrix I where I is the identity matrix of 

r 

appropriate size (i.e., I)), which is the multidimensional analogue to the 

unidimensional case (Davey, Oshima, & Lee, 1996; Li & Lissitz, 2000). Once estimated values 
of the parameters for different populations are available on their respective scales, it is of interest 
to determine the type of relationship that exists between them as a yardstick to assess whether the 
same IRT model is likely to hold in both populations (i.e., whether parameter invariance across 
the populations holds). However, the methods that are used to assess a LOI need to be carefully 
chosen as simple indices such as correlation coefficients may miss additive group level effects, 
for example (Rupp & Zumbo, 2002). 

In this paper we use the term bias coefficients while acknowledging that the work ‘bias’ has a 
variety of different usages in the statistical and non-statistical literature. In textbooks of statistical 
inference, bias is generally defined as the difference between the expected value of an estimator 
and the quantity it is trying to estimate (Casella & Berger, 1990, p. 303). In the literature on 
differential item functioning (DIF), bias is sometimes referred to as an undesired differential 
functioning 'Of items that is not attributable to ability differences on the latent dimensions the test 
is intended to measure. Under this operationalization bias produces an unfair advantage for one 
group of examinees over another as the examinees in both groups possess differing amounts of 
proficiency on the nuisance dimensions (Shealy & Stout, 1993). In this paper, the term bias 
coefficients is used to denote quantities that are derived from differences in model parameters 
due to EPD, because, if IPD goes undetected, the examinees are assigned a score that is different 
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from the one they should be correctly assigned if the drift were detected. As an additional point 
of clarification, all of the following equations invoh/e population quantities only, because the 
focus of this paper is not the estimation of biases but the analytical derivation of the idealized 

population analogues. Circumventing the estimation process allows for discussions of what can 

/ 

be considered “best-case” and “worse-case” scenarios with any real data applications being 



instantiations of these cases. 

To derive bias coefficients, consider the unidimensional two-parameter logistic (2PL) model 
for illustrative purposes where examinees are indexed by i = 1 ,...,/ , items are indexed hyj = 
1 ,..., J, and Pj(9i) is the probability of examinee i responding to itemy correctly as a function 

of the latent trait 0. The 2PL model can be written as follows (Hambleton, 1989): 



expiajiei-Pj)) 

l + exp(a^(0,. -Pj)) 



;ccj>0. 



°° < Pj,9^ < °° 



where Uj is the item slope or “item discrimination” parameter, Pj is the item location or “item 
difficulty” parameter, and 61 is the latent predictor or “proficiency” variable. In the following 



parameters from a second population of interest are denoted by a prime ( ); conceptually, neither 
population is considered more ‘important’ in any sense. Thus, they will not be semantically 
distinguished with terms such as ‘reference’ or ‘focal’ population as is done in, for example, the 
literature on differential item functioning (DIF; see Clauser & Mazor, 1998; Donoghue & Isham, 
1998; Zumbo, 1999). 

For parameters in the 2PL to be invariant in the populations of interest, one simply requires 



a'j = ttj , P'j = Pj , and 6] = to hold jointly for all items and examinees that are relevant to the 
practical context at hand if the parameters are linked onto the same scale. Due to the 
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indeterminacy of the latent scale for 6, the above identities are equivalent to the following 
equations for unlinked scales: 

a]=r'a, 

0. = £ + Sd^ 

where e and 6 are non-zero real numbers. Mathematically, parameters /a/7 to be invariant if at 
least one of these equations does not hold for at least one item or examinee in the populations of 
interest depending on which parameters are investigated for invariance. The above equations 
represent restrictive kinds of linear transformations, which is why it is inappropriate to compare 
parameter estimates from different calibrations with indices that measure linear association only. 
Hence, considerations about invariance need to include considerations of item sets as well as of 
individual items (e.g., Donoghue & Isham, 1998, Zumbo, 2003). To understand the types of 
biases that are possible under a LOI, it is insightful to consider the impact of different violations 
of the conditions above on the response probabilities. 

In generic terms, the linked parameters from the first and second population are related by or’ 
=f(a), P’ = g(P), and d’ = h(6) and are invariant only if the transformation functions 
/(•). g(). and h{) are identity functions for all items and examinees', otherwise, they fail to be 
invariant. For the sake of simplicity, the following examples of parameter invariance will be 
restricted to item parameters and will consider only linear transformation functions for or and P', 
the derivations for d are very similar to those for P since the two parameters are on the same 
scale even though the meanings behind these invariance investigations are very different. Since 
each linear relationship is represented by a line with an intercept and a slope, there are three 
cases to consider for each item parameter. 
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For 0^ we have 

a’j=aj+Tj (I) 

a'j=o)jaj (II) 

a'j=o)jaj+Tj (III) 

where 0)j and Tj are non-zero real numbers. 

For we similarly have 

fi\=Pj+Kj (IV) 

(V) 

(VI) 

where Kj and Aj are non-zero real numbers. Note that these six cases are not distinguishable for a 

given item. That is, if an item parameter value has drifted and only the drifted value is observed 
- as is generally the case when we work with estimates - then there is exactly one real-valued 
constant Tj, one real-valued factor tt), and an infinite number of real-valued pairs { tj , tt)} that 
could have given rise to the transformed value. However, if a transformation applies to sets of 
items, a distinction between the above cases is cmcial as the biases under different 
transformations are of different form and magnitude across the drifted items. The six basic cases 
(I) - (VI) lead to a total number of 1 5 cases if joint violations of invariance in and Pj are 
considered.-However, they will not all be described in detail because some cases are 
combinations of the six basic cases and follow logically from those. Hence, in the following 
section, only the six basic cases are used to express biases first on the logit scale. The section 
after that then shows how these biases can be translated into biases on probability scale to 
clearly highlight their practical utility as differences in response probabilities and related tme 
scores are the focus of practical decision-making. 
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Bias on the Logit Scale 

For some cases, the biases that are introduced by violations (I) - (VI) can be compactly 
written with coefficients on the scale that is defined by the link function. The logit scale was 

chosen for analytical convenience but any other transformation with appropriate properties (e.g., 

/ 

the probit transformation) will technically work as well. We will present only the simple cases (I) 
and (IV) below and have collected the remaining four cases in the Appendix. For each case (a) 
the new relationship between the parameter values and (b) the introduced bias on the logit scale 
will be presented. The bias coefficients will then be interpreted but it is thus crucial to note that 
the interpretation is with respect to the logit scale and does not necessarily mirror the 
interpretation that would be appropriate on the probability scale. Since most practitioners are 
probably more interested in the implications of biases for response probabilities and test scores, 
these will be discussed in a later section and several biases on the probability scale will be 
interpreted there. The following description therefore primarily highlights succinctly the 
interrelationships between the parameter transformation function (i.e., the type of LOI) and the 
logit scale formulation of the two-parameter kernel. 

Case (D - Non-zero intercept for a' 

For non- zero real numbers S and tj, 

(a) a; =S-' {aj +Tj) = S~'aj + S~' Tj 

(b) logit[i^:(0;)] = = + = + 

where 5®’^ = r^.(0, -y?^.) is an additive bias coefficient whose absolute magnitude depends on 

the location difference 6i - Pj and Sis the global transformation parameter - hence, no subscript - 
required to link scales. Therefore, for a given item, the introduced logit-scale bias is larger in 
absolute magnitude for an examinee whose ability is very different from the difficulty of the item 
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than for an examinee whose ability level is closer to the difficulty of the item. No bias exists for 
examinees whose ability level is identical to the item difficulty. 

Case flVi - Different intercept for 

For non-zero real numbers e, S, and Kj, 

/ 

(a) p]=e + 5{pj + K. ) = (e + 5k j ) + 5pj 

(b) \ogx\[P]{d\)] = aj{e-{pj + Kp)=aj{d,-pj)-ajKj=\og\X{Pj{dp] + B^ 

where = - c^Kj is an additive bias coefficient whose magnitude depends, for each item, on its 

discrimination parameter and E and 5 are the global transformation parameters — again, no 
subscript - required to link scales. Therefore, items with higher discrimination values will have a 
larger logit-scale bias independent of the location difference between examinee and item, which 
is actually an accurate description of the bias on the probability scale for this case as well. 

In all cases it is clear that the biases result in differences in item characteristic curves (ICCs), 
which equal differences in response probabilities for all or almost all examinees. But since the 
logit transformation is non-linear, the effects of biases on the logit and probability scales are 
different and the additivity of bias on the logit scale is not preserved on the probability scale. It is 
thus necessary to translate the logit-scale biases into probability-scale biases. The following 
section discusses the practical utility of the bias coefficients for the estimation of response 
probabilities and true scores and shows how these results are useful for the study of IPD. 

Bias on the Probability Scale 

It is possible to use the above formulations to analytically compute differences in response 
probabilities at the population level as is done empirically in studies of IPD for calibration 
samples (e.g.. Wells et al., 2002; see also Donoghue & Isham, 1998). Conceptually, IPD is 
typically defined as the differential shift of item parameters over time (Goldstein, 1983), which is 
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often attributed to educational, technological, or cultural changes (Bock, Muraki, & 
Pfeiffenberger, 1988). Mathematically, it is readily seen that IPD represents LOI at the item level 
where IPD in either aox P leads to a change in the respective parameter value with the form of 

the exact transformation from aio a’ ov pio p' unknown. Hence, one way to represent IPD at 

/ 

the item level is 

otj=aj+Tj (A) 

P)=Pj + K. (B) 

where - with the first inequality ensuring that > 0 . In other 

words, all cases (I) - (VI) are cases of IPD but the simplest way to simulate drift and to 
analytically investigate it is by casting it as an additive formulation. Since graphical comparisons 
of ICCs are made on the probability scale it is helpful to translate the above statements into bias 
statements on that scale. To combine the discussion for both cases into one, consider & general 
additive bias on the logit scale where is any non-zero real number: 

logit[i’; {6] )] = aj (0, -Pj) + ^ = logit[i’,. (0, )] + (^ . 

On the probability scale, this is written as 

, , exp[Q!/g,. -y9^.)]exp[(^] exp[g^.(g,. -^^ )] 

^ l + exp[a^.(0, -y3^.)]exp[(^] exp[-(^] + exp[a^.(0,. -^^.)] ’ 

which can be compared to the response function with item parameters that have not drifted, 

^ ' \ + exp[aj(6i-Pj)] 

A few basic algebraic steps result in the following relationships: 
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(!)<Qz^p]{e])<Pj{d,) (Ri) 

(!> = Q^p'j{e]) = Pj{d,) (R2) 

(!)>Q^Pj{e\)>Pj{d,) (R3) 

where the arrow denotes an implication. In other words, if the additive logit-scale bias is 
positive, the probability under the drifted parameters will be positively biased; if it is negative, it 
will be negatively biased; otherwise, the two probabilities will be identical. The relationships 
(Rl) - (R3) are not equivalences, however, because differences in response probabilities can 
have many causes only one of which is an additive bias on the logit scale. 

Consider the Wells et al. (2002) study for illustrative purposes. The authors simulated drift in 
the population values of the item difficulty and discrimination parameters in a 2PL. Only 
positive amounts of drift were considered and the effect of item parameter drift on the estimation 
of examinee ability parameters was estimated under 48 conditions: Test length (2 levels) x 
sample size (2 levels) x type of drift (3 levels) x number of drift items (4 levels). More 
specifically, if an item was selected to display item parameter drift, the authors increased either 
the discrimination parameter by .5 or the difficulty parameter by .4 or both simultaneously by .5 
and .4 respectively. 

It is immediately clear that increasing a difficulty parameter by some positive number leads 
to an ICC that is shifted to the right and that increasing a discrimination parameter by some 
positive number leads to an unchanged inflection point but a steeper slope. Yet, in addition to the 
conceptual understanding, it is possible to quantify these changes more precisely and the bias 
coefficients allow us to do just that. When the authors changed an item discrimination parameter 
value by .5, they introduced a bias of 
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according to case (I). For drifted items, this results in ICC segments that are shifted upward for 
positive bias (R3), which occurs when 6j > Pj , ICC segments that are shifted downward for 

negative bias (Rl), which occurs when (9, < pj , and an identical ICC value for no bias (R2) at 
0. = Pj . This pattern was observed (see Wells et al, 2002, Figure 1 a, p. 80) and plotted with 
respect to the estimated true score, which is computed as the sum of the ICCs over all items in 
the test: 

The resulting curve that traces the true score as a function of the latent indicator 6 is called the 
test characteristic curve (TCC) and it was seen that the overall shift in the TCC was relatively 
minimal, because only a few items exhibited drift in each of the design conditions in the study. 
This also stems from the fact that the differences in response probabilities are actually relatively 
minor. To illustrate this, let us formally denote the difference in response probabilities by A,y, 

with -i<A,<i. 

Table Al shows the Ay values (cell entries) for an or-drift of .5 as a function of the original 
discrimination value of an item (row value) and the location difference between an examinee and 
an item on the 0 scale, 0, - Pj (column value). For example, take an item with an original 

discrimination value of .75 and an examinee whose location on the latent scale is .5 units above 
the location of the item (i.e., 0, — Pj — .5 ). The bias that gets introduced for this examinee on 

this item under drift of the discrimination parameter manifests itself in a difference in response 
probabilities of only Ay = -.0586883 . In other words, the response probability under the drifted 

discrimination parameter is about 6% higher than imder the non-drifted parameter. It can be seen 
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in this table that most Ay values are between .05 and .10. In other words, between 10 to 20 items 
with an cr-drift of this magnitude are necessary to result in a true score difference of only 1 point, 
a difference that would probably be considered rather trivial for most practical circumstances. 

When the authors changed an item difficulty parameter value by .4, they introduced a 
negative bias of — —cOjCCj — —AoCj < 0 , according to case (III), where the inequality stems 
from the fact that item discrimination values are always positive. For drifted items, this results in 
ICCs that are shifted to the right according to (Rl) independent of the values of di and Pj , which 

was observed with again relatively minimal effects in terms of the TCC (see Wells et al., 2002, 
Figure 2a, p. 82). Again, the A,y values for a variety of item discrimination parameters and 
location differences can be computed (see Table A2) and, again, most of the A,y values for 
moderately discriminating items and moderate location differences are between .05 and .10 albeit 
some cases with higher values can be observed. Just as before this means that for the majority of 
’ cases between 10 to 20 items with a ^drift of this magnitude are required to produce a true-score 
change of 1 point, a relatively minor effect. 

Finally, when the authors changed both the item discrimination parameter value by .5 and the 
item difficulty parameter value by .4, they introduced multiple biases. Even though conditions 
for when upward and downward shifts of the ICCs occur can be formally stated, those conditions 
are relatively cumbersome to present and are omitted here. In the study, the authors report that 
the TCCs cross at a value 6o where > Pj . The effects of the biases as seen in the TCCs were 

again relatively minimal for most values of 6 but now started to increase in magnitude for 
specific sub-regions of the 0 space compared to the previous scenarios (see Wells et al., 2002, 
Figure 3a, p.83). Table A3 shows the A,y values for this scenario and it can be seen that these are 
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still often between .05 and . 10 but that now there are also quite a few values in the range of . 10 to 
. 1 5 with some even reaching .20. Thus, even though between 10 and 20 items are required to 
produce a true score difference of 1 point for many cases under this joint cc- and ^drift, only 
around 5 to 7 items are required in other cases. 

i 

All three scenarios show that when the pattern of introduced biases is expressed with respect 
to response probabilities it appears quite complex due to the curvature and asymptotic behavior 
of the ICCs. It is possible to plot the Ay values to illustrate this complex behavior more closely. 
Figure 1 shows the A,y surface and contour plots for the or-drift of .5, item discrimination values 
of non-drifted items between 0 and 2, and location differences between -2 and 2 to match the 
structure of Table A1 while utilizing more grid points. Note that for the surface plot A,y is labeled 
‘Delta’, the location difference is labeled ‘Theta-Beta’, and the non-drifted discrimination values 
are labeled ‘Alpha’. Furthermore, the orientation of the contour plot matches the orientation of 
the surface plot so that the horizontal axis represents the location difference values, the vertical 
axis represents the item discrimination values, and the contour lines and shades represent the Ay- 
values with lighter shades corresponding to higher A,y values and darker shades corresponding to 




(a) Surface Plot for cr-drift 



(b) Contour Plot for or-drift 
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Figure 1. Surface and contour plots of A,y for a’ = cc + .5. 

To understand these plots, note that when an item discrimination parameter drifts the slope of the 
ICC for the item with the drifted parameter is steeper, which results in increasing differences in 
response probabilities in both directions from the inflection point for some range of lvalues 



followed by decreasing differences as the original ICC and the ICC under drift approach their 
asymptotes. As an example of this behavior. Figure 2 shows a plot of a drifted item with original 
discrimination value (X— 1, discrimination value cc’ = .5 = 1.5 after drift, and difficulty value 

P=0. Note that the latent trait 0is labeled ‘Theta’ and thatP^.(0,) is labeled ‘Probability’. 



Figure 2. ICCs for item with o-drift of .5. 

The differences are positive to the left of the inflection point and negative to the right of the 
inflection point as a result of how A,y was defined here. The surface and contour plots of Figure 3 
show graphically how differences are largest in absolute magnitude for the least discriminating 
items and smallest for the most discriminating items for the range of location differences 
considered. This makes sense, because if the slope of an ICC is already rather steep without drift 
present (i.e., when an item is already highly discriminating) then a further increase in slope will 
have relatively little impact on response probability differences. This implies that if items are of 
at least reasonable discriminatory power for a given population (e.g., if they have an aj value of 




at least 1) biases are not as extreme. 
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Figure 3 shows the Ay surface and contour plots for the ^drift of .4, item discrimination 
values of non-drifted items between 0 and 2, and location differences between -2 and 2 to match 
the structure of Table A2 while again utilizing more grid points. The labeling corresponds to that 
of Figure 1. 




-2 -1 0 1 2 



(a) Surface Plot for yS-drift (b) Contour Plot for /3-drift 

Figures. Surface and contour plots of Ay for f3’ = P + .4. 

This shows that when an item difficulty parameter drifts, the effect is asymmetric with respect to 
the inflection point of the ICC. Figure 4 shows a plot of a drifted item with original difficulty 
value p= 0, drifted difficulty value p’ = P+ A = .4, and discrimination value a= 1 to illustrate 
this behavior. The labeling corresponds to that of Figure 2. 



Prictobility 




-2 0 2 4 



Figure 4. ICCs for item with /3-drift of .4. 
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As previously seen in the surface and contour plots of Figure 3, the difference in response 
probabilities gets larger in absolute magnitude as the discrimination value gets larger and items 
with higher discrimination values have a higher bias for smaller location differences. 

Finally, Figure 5 shows the Ay surface and contour plots for a joint or-drift of .5 and ^drift of 

/ 

.4 for item discrimination values between 0 and 2, and location differences between —2 and 2 to 
match the structure of Table A3 while again utilizing more grid points. The labeling corresponds 
to that of Figure 1 and Figure 3. 




- 2-1012 



(a) Surface Plot for joint a- and yS-drift (b) Contour Plot for joint a- and /?-drift 

Figure 5. Surface and contour plots of Ay for u’ = o?+ .5 and P = .4. 

This plot shows the complex effects that both drift types have on the difference in response 
probabilities and one can readily identify characteristics of the previous two cases. For example, 
note the almost linear difference values for poorly discriminating items in the location difference 
range considered here due to flat ICCs and the pronounced spike in difference values for highly 
discriminating items similar to the cases before. As an example of the complex behavior of the 
Ay values. Figure 6 shows the ICC of an item with original parameter values ^—0, oc— 1, and 
the ICC of the same item with drifted parameter values P’ = .4 and a’ = 1 .5. The values were 
chosen to match the effects shown in Figure 2 and Figure 4 and the labeling is identical to these 
figures as well: 
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Figure 6. Plot of ICCs for item with joint a-drift of .5 and y5-drift of .4. 

These complex relationships raise the issue of what kind of discrimination values and 
location differences are typically observed in practice. It seems clear that extreme differences of, 
say, ± 2 or 3 units, can be observed in many practically relevant cases if test data are collected 
with item and examinee population subsets that yield a wide range of item and examinee 
parameter values. Indeed, a good test often consists of items with a wide variety of difficulty 
levels and a moderate range of discrimination values and is typically given to examinees with a 
wide range of ability levels with the implicit hope that item and examinee properties are well 
captured by the chosen model. Whether or not the intersection of a given examinee with a given 
item results in a large bias under drift of some parameter cannot be generally answered, however, 
and depends on the type and magnitude of drift. 

Conclusions 

This paper also underscored that parameter invariance is an ideal state that is technically 
violated if at least one identity condition does not hold for at least one examinee or item. 
Violations can be of any kind but three linear non-identity transformations were considered and 
the biases introduced on the logit scale by this LOI were represented with bias coefficients 
whenever possible. The bias coefficient framework primarily serves to highlight the 
dependencies of different types of bias on model parameters that have not drifted and it allows 
one to quickly gauge the severity of biases on the logit and probability scales. From a practical 
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viewpoint, the different perspectives taken here allow one to compute and visualize different 
biases directly, without having to resort to simulation studies or real data sets, which can easily 
be done for a variety of different conditions. The framework can thus be used to cleanly assess 

the impact certain biases have on the response probabilities and examinee true scores; any real- 

/ 

life data set will be a mixed bag of different biases that falls somewhere between the clean 
analytical extremes. Most importantly, this paper and other research suggest that IRT models 
inferences about examinees are relatively robust toward moderate amounts of IPD across a wide 
range of theoretical conditions. It is hoped that this paper contributes to the on-going process of 
clarifying what is meant by parameter invariance and to demystify its status, which is often 
misperceived as a “mysterious” property that all IRT models seem to possess by definition across 
an almost infinite range of populations and conditions. If sound theoretical discussions about 
scientific generalizability are desired, this paper shows that the mathematical foundations of 
parameter invariance as a fundamental property of measurement carmot be ignored. 



O 

ERIC 



21 



Bias Coefficients for Lack of Invariance 21 
References 

Adams, R. J., Wilson, M., & Wu, M. (1997). Multilevel item response models: An approach to 
gffors in variable regression. Journal of Educational and Behavioral St atistics, 22, 47-76. 

Bock, R. D., Muraki, E., & Pfeiffenberger, W. (1988). Item pool maintenance in the presence of 

/ 

item parameter drift. Journal of Educational Measurement, 25, 275-285. 

Casella, G., & Berger, R. L. (1990). Statistical inference. Belmont, CA: Duxbury Press. 

Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially 
functioning items. Educational Measurement: Issues and Practice, 17, 31-45. 

Davey, T., Oshima, T. C., & Lee, K. (1996). Linking multidimensional item calibrations. 

Applied Psychological Measurements. 20, 405-4 1 6. 

Donoghue, J. R., & Isham, S. P. (1998). A comparison of procedures to detect item parameter 
drift. Applied Psychological Measurement, 22, 33-5 1 . 

Fox, J., & Glas, C. A. W. (2001). Bayesian estimation of a multilevel IRT model using Gibbs 
sampling. Psvchometrika. 66, 271-288. 

Goldstein, H. (1983). Measuring changes in educational attainment over time: Problems and 
possibilities. Journal of Educational Measurement, 20, 369-377. 

Hambleton, R. K. (1989). Principles and selected applications of item response theory. In R. L. 

Linn (Ed.), Educational Measurement (pp. 147 - 200). New York: Macmillan. 

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response 
theory. Newbury Park, CA: Sage. 

Hambleton, R. K., & Jones, R. W. (1993). Comparison of classical test theory and item response 
theory and their applications to test development. Educational Measurement: Issues and 
Practice. 12. 38-47. 




22 



Bias Coefficients for Lack of Invariance 22 



Junker, B. W. (1999). Some statistical models and computational methods that may be useful for 
ropnitivelv-relevant assessment. Unpublished manuscript. Available online at 
http://www.stat.cmu.edu/~brian/nrc/cfa 

Kaskowitz, G. S., & de Ayala, R. J. (2001). The effect of error in item parameter estimates on the 

/ 

test response function method of linking. Applied Psychological Measurement, 25, 39-52. 

Li, Y. H., & Lissitz, R. W. (2000). An evaluation of the accuracy of multidimensional IRT 
linking. Applied Psychological Measurement. 24, 115-138. 

Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, 
NJ: Erlbaum. 

Rupp, A. A., «& Zumbo, B. D. (2002). How to quantify and report whether parameter invariance 
holds: When Pearson correlations are not enough. Manuscript submitted for publication. 

Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true 
bias/DIF from group ability differences and detects test bias/DTF as well as item 
bias/DIF. Psvchometrika. 58. 159-194. 

Wells, C. S., Subkoviak, M. J., & Serlin, R. C. (2002). The effect of item parameter drift on 
examinee ability estimates. Applied Psychological Measurement, 26, 77-87 . 

van der Linden, W. J., & Hambleton, R. K. (1997). Handbook of modem item response theory. 
New York: Springer-Verlag. 

Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning 
tDIFl: Logistic regression modeling as a unitary framework for binary and Likert-type 
(ordinal! item scores. Ottawa, ON: Directorate of Human Resources Research and 
Evaluation, Department of National Defense. 




23 



Bias Coefficients for Lack of Invariance 23 



Zumbo, B. D. (2003). Does item-level DIF manifest itself in scale-level analyses? Implications 



for translating language tests. Language T esting. 20, 136-147. 



Bias Coefficients for Lack of Invariance 24 



Appendix 




25 



Bias Coefficients for Lack of Invariance 25 






O 

a 



a 

o 

cd 

o 

O 



+ 

II 

c2 

<5‘ 



lA 

Q> 

> 



a 



00 






05 

S 


B 

S 


m 


<r» 


% 




F! 

O 


\D 

S 


oS 

a 


00 

iH 

q 


d 

o 


d 


d 


d 


d 


o 


d 


d 






^ dS N 

S M 

sad 



in kD 
§8 
g s 



in 

00 



a 



oooooooo 



cr» cn kD ^ 



°ScX3vB^pCNQ 

Hoo^oor^^yi5e 

HHOOOOOO 

oooooooo 




oooooooo 



CN 00 ro ^ M rn 

S 03 00 CJ^ o H 

LT|coooHrnr-rn^ 

^O}kDS*d<00HO4 

■ S S g .s § 

o o o o o o 



o S o o o o 
oooooooo 



01 cr» VD 

OLDr-lLOkDOjkDrO 

t^OOOOkD^rOOjQ 

HOO^OO^kDmje 

Hr-joooqoq 

oooooooo 



H m vD 



00 


g 

m 

00 


iH 

a 

CN 


o 


s 

tS< 

OS r- 


s 

1 


iH 


a 




00 

o 


g s 


s 


d 


o 


o 


o 


d d 


o 



o 

d 



a 

m 

H H rH 



ro 



s 



g [s ^ ^ ^ a 
g iS S g s s g 
3 3 B s a a § 

oooooooo 



\r> o in 



JSpiUBJBJ UOltBUIXUUOSia 



■Q," 

I 

0," 



0> 

4-> 

o 

Z 





Bias Coefficients for Lack of Invariance 26 



+ 

II 

Uh 

.2 

<1 



C/3 

"c3 

> 



«N 

< 






<JS* 

d> 

o 

a 

<i> 

Ut 



o 

ts 

o 

o 



H 

H ^ 0» 00 
00 ^ 
o o r- 



00 <Tj ID 
VO H ~ 



(J\ 

^ H r^. 



O H 



H o» 00 ro ID 00 

S S S S S S S S 

oooooooo 



rH 


CD 






ID 


CD 


VO 


m 


00 




CTk 


<J\ 


00 




s 


iH 

VO 


iH 

CD 


S 


ID 


ID 


CN 

CD 


ID 


CTk 


C^ 


00 


ID 


<J\ 


CN 




ID 


VO 


VO 


VO 


ID 


ID 


O 


O 


O 


o 


O 


O 


O 


c> 


CD 


CD 


CD 


CD 


CD 


O 



m 

CN 

o 



r* 00 ID ro 
VO VO m CN 



. CD O 
O ID ^ 

CX) 00 LD ^ 

S VO 00 <T» 

O O “ 



ID 00 CN 
■ CN 



CN 
VO 
00 VO 



O r~i rH 



Pi 



OOOOOOOO 



rH 
C\ C\ 



G\ ys> c\ 

^ CT> ^ 



^ H 
ID O 
CN 00 



CN 

O 



G\ 
O O 



ID CTl VO ID 
^ ^ CN 

H H CN 
O H CN H 

Pi 31 S 



O 


o 


o 


O 


o 


o 


O 


o 


Si 




ID 

CN 


P: 


dv 


VO 


00 






m 




00 


ID 


ID 


00 




dv 


00 




VO 




VO 


iH 


CTk 




a\ 




00 


CN 


ID 


00 


CTk 


CN 

o 


s 


ES 

O 


dv 

o 


Pi 


3! 


VO 

T— 1 


00 

rH 


o 


o 


O 


O 


o 


O 


o 


O 



ID CN CN 

S CN ID O H 

VO ID CTi VO ID 

00 ^ ^ ID cn 

00 CIV 00 m ^ CN 

VO 00 O H “ 

O O rH I— I 



s 



a\ 

o 



Pi Pi 



OOOOOOOO 



CN CD CIV VO 

VO ID CN 00 00 00 

“ “■ LD CN o 

VO m VO 00 
^ m 00 H 
VO VO 
o o o o 

(OOOOOOOO 




CIV ID H 
dk VO VO 
dv m o H 
00 dv CN CD 
• 1-1 CN 
ID ID 

o o 



m 

CN 

o 



CN H ro VO 
ID 00 CD 
VO VO CN ^ 
(X) VO 00 LD 
O CN ID 
■ ■ n CN 

O O 



s s 



OOOOOOOO 





CN 


ID 


CN 


ro 


dv 


CN 




VO 




O 


CN 


00 


00 


CTk 


VO 




ro 


ro 


CN 


ro 


rH 




ID 


o 




00 


ID 


m 




O 


VO 


CX) 


o 




CN 


CD 




ro 


CN 


CN 


rH 


o 


O 


O 


o 


o 


o 


o 


CD 


O 


O 


o 


o 


o 


o 














m 


*n 


o 


</v 




rs 




r; 


C'i 








—I 







dv 

o 

o 



!v J3J31UBJB J UOpBUIlUUOSIQ 



I 

g 

0," 



<D 

o 



o 

ERIC 



27 



Bias Coefficients for Lack of Invariance 27 







G B 

CTk m 




00 


;:1 


s 


B 


VO 


3 


fo 


00 

00 


m 

00 




^ 8! 


G 


§ 


p 


1 — 1 


a 


PI 


§ S 


s 


o 


o 

o 


8 


d 


d 


d d 


d 


d 


d 


d 



(Tk VO (Tk 

(Tk in ‘ 

iH (Tk H 

^ G 



G 

O 



B S 

d 



CN ^ 
CO (N 



PI 

s 

CN 
O 
O 

O O O O 



PJ ^ G 

CO CO ro CN 

9 R 



, 9 s 
G S B 



s 

^ rH rH 

i § g 

rn n in 

8 tq B 

o o o o 

d d d d 



SESsSgsS 



s a s G g 8 a 

O O rH ro un CD O 

o o o o 



o o o o o 



<u 

CJ 

a 

<u 

1-1 

iS 



CTk ^ rH _ 

CTk H CO CTk ^ 

S (N O VO CN H 

^ rO CTk I— 1 vTk 

uS o ^ ai " 

yQ ® 

o o o o 

d o o o o o o 



CN VO 

a ?1 



s 

s 

a 

d 



a 

o 

u 

o 



85 If! a g a a g 

siasssp, 

dddooooo 



+ 

II 




+ 

II 

iM 

c2 

< 

O 

c/3 

0) 

13 

> 

I 

m 

< 



CN H in VO in 

VO VO CN in cji 

g s g s 

rH rH rH 




OOOOOOOO 



00 

m 

9S ^ 

CTk CTk 
O VO 
CN H 



O O O 



CTk 

CTk 


p 


00 

ro 


VO 


o 


CTk 


CTk 


CTk 




VO 






00 


in 


ro 


O 


o 


o 


d 


d 


d 



G 



GO 

.. P 

CO CTk 



3 

o 

d 



fS »n 



Ijo japurejBtl UOTIBUTUIUOSIQ 



I 

N 

CL" 



d> 

■4-» 

o 

z 



o 

ERIC 



28 



Bias Coefficients for Lack of Invariance 28 





29 



Bias Coefficients for Lack of Invariance 29 




first part cannot be compactly written using bias coefficients and5“ = - m is again an additive bias coefficient whose magnitude 
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